
Topic modeling is a machine learning technique that automatically analyzes text data to determine cluster words for a set of documents. This is known as ‘unsupervised’ machine learning because it doesn’t require a predefined list of tags or training data that’s been previously classified by humans.
I chose 3 topic modeling techniques:
I will implement and compare those topic modeling techniques on the 20 newsgroups dataset, which is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
# Importing general libraries
import re
import numpy as np
import pandas as pd
from pprint import pprint
# Importing the Gensim library
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
# I will use this library for implementing the truncated singular value decomposition for the LSA model
from gensim.models import LsiModel
# Importing nltk and downloading stopwords
import nltk
nltk.download('stopwords')
# Importing spacy for lemmatization
import spacy
# Importing the BERTopic model
from bertopic import BERTopic
# Importing the sentence-transformers package for the purpose of document embeddings
from sentence_transformers import SentenceTransformer
from sentence_transformers import *
# Importing UMAP for dimensionality reduction in the BERTopic model
import umap
# Importing HDBSCAN to perform its clustering
import hdbscan
# Importing various dimensionality reduction and clustering techniques
from sklearn.cluster import MiniBatchKMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# Importing LexRank, an unsupervised approach to text summarization based on graph-based centrality scoring of sentences
from lexrank import *
# Importing the torch package
import torch
# Importing plotting tools
import pyLDAvis
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline
# Enabling logging for gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)
# Importing warnings
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
[nltk_data] Downloading package stopwords to [nltk_data] C:\Users\Yoni\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date!
# Importing NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
# Importing Dataset
df = pd.read_json('newsgroups.json')
print(df.target_names.unique())
df.head(10)
['rec.autos' 'comp.sys.mac.hardware' 'comp.graphics' 'sci.space' 'talk.politics.guns' 'sci.med' 'comp.sys.ibm.pc.hardware' 'comp.os.ms-windows.misc' 'rec.motorcycles' 'talk.religion.misc' 'misc.forsale' 'alt.atheism' 'sci.electronics' 'comp.windows.x' 'rec.sport.hockey' 'rec.sport.baseball' 'soc.religion.christian' 'talk.politics.mideast' 'talk.politics.misc' 'sci.crypt']
| content | target | target_names | |
|---|---|---|---|
| 0 | From: lerxst@wam.umd.edu (where's my thing)\nS... | 7 | rec.autos |
| 1 | From: guykuo@carson.u.washington.edu (Guy Kuo)... | 4 | comp.sys.mac.hardware |
| 2 | From: twillis@ec.ecn.purdue.edu (Thomas E Will... | 4 | comp.sys.mac.hardware |
| 3 | From: jgreen@amber (Joe Green)\nSubject: Re: W... | 1 | comp.graphics |
| 4 | From: jcm@head-cfa.harvard.edu (Jonathan McDow... | 14 | sci.space |
| 5 | From: dfo@vttoulu.tko.vtt.fi (Foxvog Douglas)\... | 16 | talk.politics.guns |
| 6 | From: bmdelane@quads.uchicago.edu (brian manni... | 13 | sci.med |
| 7 | From: bgrubb@dante.nmsu.edu (GRUBB)\nSubject: ... | 3 | comp.sys.ibm.pc.hardware |
| 8 | From: holmes7000@iscsvax.uni.edu\nSubject: WIn... | 2 | comp.os.ms-windows.misc |
| 9 | From: kerr@ux1.cso.uiuc.edu (Stan Kerr)\nSubje... | 4 | comp.sys.mac.hardware |
# Converting to list
data = df.content.values.tolist()
# Removing Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
# Removing new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]
# Removing distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]
pprint(data[:1])
['From: (wheres my thing) Subject: WHAT car is this!? Nntp-Posting-Host: ' 'rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: ' '15 I was wondering if anyone out there could enlighten me on this car I saw ' 'the other day. It was a 2-door sports car, looked to be from the late 60s/ ' 'early 70s. It was called a Bricklin. The doors were really small. In ' 'addition, the front bumper was separate from the rest of the body. This is ' 'all I know. If anyone can tellme a model name, engine specs, years of ' 'production, where this car is made, history, or whatever info you have on ' 'this funky looking car, please e-mail. Thanks, - IL ---- brought to you by ' 'your neighborhood Lerxst ---- ']
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations
data_words = list(sent_to_words(data))
print(data_words[:1])
[['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'of', 'maryland', 'college', 'park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']]
# Building the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])
['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp_posting_host', 'rac_wam_umd_edu', 'organization', 'university', 'of', 'maryland_college_park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front_bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']
# Defining functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
def make_bigrams(texts):
return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
"""https://spacy.io/api/annotation"""
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
return texts_out
# Removing Stop Words
data_words_nostops = remove_stopwords(data_words)
# Forming Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)
# Initializing spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
# Preforming lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
print(data_lemmatized[:1])
[['s', 'thing', 'car', 'nntp_poste', 'host', 'park', 'line', 'wonder', 'enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'door', 'really', 'small', 'addition', 'separate', 'rest', 'body', 'know', 'tellme', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'make', 'history', 'info', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']]
# Creating Dictionary
id2word = corpora.Dictionary(data_lemmatized)
# Creating Corpus
texts = data_lemmatized
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
# Viewing the Term Document Frequency
print(corpus[:1])
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 5), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 2), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1)]]
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]
[[('addition', 1),
('body', 1),
('bring', 1),
('call', 1),
('car', 5),
('day', 1),
('door', 2),
('early', 1),
('engine', 1),
('enlighten', 1),
('funky', 1),
('history', 1),
('host', 1),
('info', 1),
('know', 1),
('late', 1),
('lerxst', 1),
('line', 1),
('look', 2),
('mail', 1),
('make', 1),
('model', 1),
('name', 1),
('neighborhood', 1),
('nntp_poste', 1),
('park', 1),
('production', 1),
('really', 1),
('rest', 1),
('s', 1),
('see', 1),
('separate', 1),
('small', 1),
('spec', 1),
('sport', 1),
('tellme', 1),
('thank', 1),
('thing', 1),
('wonder', 1),
('year', 1)]]
Latent Dirichlet Allocation (LDA) is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. In this, observations (e.g., words) are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics. LDA is one of the most popular topic modeling methods.
# Building the LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=20,
random_state=100,
update_every=1, # Determines how often the model parameters should be updated
chunksize=100, # The number of documents to be used in each training chunk
passes=10, # Total number of training passes
alpha='auto',
per_word_topics=True)
# Printing the Keyword in the 20 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]
[(0, '0.024*"kill" + 0.023*"live" + 0.021*"death" + 0.017*"die" + ' '0.017*"physical" + 0.015*"center" + 0.014*"bike" + 0.014*"attack" + ' '0.012*"israeli" + 0.012*"fire"'), (1, '0.621*"ax" + 0.018*"slow" + 0.014*"brain" + 0.014*"review" + 0.012*"mb" + ' '0.011*"clipper_chip" + 0.010*"sc" + 0.010*"printer" + 0.009*"box" + ' '0.008*"mouse"'), (2, '0.075*"space" + 0.063*"gun" + 0.022*"launch" + 0.021*"earth" + ' '0.019*"firearm" + 0.017*"orbit" + 0.017*"mission" + 0.017*"series" + ' '0.015*"vehicle" + 0.015*"year"'), (3, '0.150*"com" + 0.048*"mount" + 0.046*"apple" + 0.037*"ram" + ' '0.026*"corporation" + 0.025*"frame" + 0.025*"task" + 0.022*"spring" + ' '0.020*"locate" + 0.019*"spacecraft"'), (4, '0.024*"evidence" + 0.019*"believe" + 0.016*"claim" + 0.016*"reason" + ' '0.014*"man" + 0.014*"exist" + 0.012*"sense" + 0.012*"book" + 0.012*"life" + ' '0.011*"faith"'), (5, '0.024*"thank" + 0.024*"line" + 0.019*"program" + 0.018*"file" + ' '0.017*"mail" + 0.017*"system" + 0.014*"card" + 0.014*"include" + ' '0.014*"send" + 0.013*"run"'), (6, '0.322*"drive" + 0.080*"disk" + 0.054*"scsi" + 0.036*"gateway" + ' '0.035*"motherboard" + 0.015*"bank" + 0.015*"please_respond" + ' '0.014*"greatly_appreciate" + 0.012*"fast" + 0.012*"n"'), (7, '0.099*"nhl" + 0.070*"cop" + 0.026*"enable" + 0.025*"police" + 0.020*"plot" ' '+ 0.018*"conservative" + 0.015*"row" + 0.014*"neat" + 0.014*"closely" + ' '0.011*"sharp"'), (8, '0.073*"directory" + 0.061*"battery" + 0.027*"phase" + 0.019*"consult" + ' '0.016*"sustain" + 0.014*"weeks_ago" + 0.013*"scott_roby" + 0.010*"ave" + ' '0.009*"space_shuttle" + 0.009*"powerbook"'), (9, '0.196*"window" + 0.058*"do" + 0.056*"monitor" + 0.054*"character" + ' '0.040*"section" + 0.039*"recommend" + 0.029*"usenet" + 0.028*"font" + ' '0.023*"workstation" + 0.020*"laboratory"'), (10, '0.095*"season" + 0.053*"pen" + 0.044*"trade" + 0.042*"objective" + ' '0.040*"rational" + 0.039*"star" + 0.030*"morality" + 0.030*"past" + ' '0.027*"predict" + 0.024*"penguin"'), (11, '0.045*"soldier" + 0.042*"armenian" + 0.040*"village" + 0.037*"greek" + ' '0.027*"turk" + 0.027*"turkish" + 0.025*"occupy" + 0.019*"terrorism" + ' '0.017*"northern" + 0.014*"inhabitant"'), (12, '0.053*"upgrade" + 0.047*"pack" + 0.043*"library" + 0.040*"dog" + ' '0.038*"status" + 0.034*"clock" + 0.028*"floppy" + 0.025*"electrical" + ' '0.025*"ftp_site" + 0.025*"routine"'), (13, '0.031*"write" + 0.022*"make" + 0.021*"know" + 0.021*"say" + 0.020*"think" + ' '0.020*"article" + 0.019*"people" + 0.015*"see" + 0.012*"thing" + ' '0.012*"way"'), (14, '0.092*"team" + 0.087*"game" + 0.061*"play" + 0.056*"win" + 0.044*"year" + ' '0.027*"division" + 0.023*"score" + 0.022*"wing" + 0.021*"fan" + ' '0.019*"run"'), (15, '0.070*"state" + 0.059*"government" + 0.055*"law" + 0.037*"right" + ' '0.022*"country" + 0.021*"protect" + 0.018*"pin" + 0.017*"crime" + ' '0.017*"watch" + 0.016*"citizen"'), (16, '0.047*"line" + 0.044*"get" + 0.034*"go" + 0.030*"nntp_poste" + ' '0.027*"organization" + 0.023*"host" + 0.021*"m" + 0.019*"good" + ' '0.015*"look" + 0.014*"time"'), (17, '0.060*"key" + 0.047*"system" + 0.034*"chip" + 0.030*"bit" + ' '0.029*"technology" + 0.023*"public" + 0.023*"phone" + 0.022*"datum" + ' '0.021*"cpu" + 0.018*"encryption"'), (18, '0.048*"problem" + 0.035*"use" + 0.015*"talk" + 0.014*"work" + 0.014*"high" ' '+ 0.014*"science" + 0.010*"set" + 0.010*"value" + 0.010*"current" + ' '0.010*"reference"'), (19, '0.044*"model" + 0.040*"device" + 0.036*"wire" + 0.033*"power" + ' '0.032*"replace" + 0.030*"bus" + 0.026*"unit" + 0.025*"internal" + ' '0.023*"ground" + 0.022*"external"')]
# Computing Perplexity
print('\nPerplexity Score: ', lda_model.log_perplexity(corpus)) # a measure of how good the model is. lower the better.
# Computing Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
Perplexity Score: -13.257142263819764 Coherence Score: 0.484063757142487
# Visualizing the topics using pyLDAvis package's interactive chart
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)
vis
My Proposal for finding the optimal number of topics is to build many LDA models with different values of number of topics (k) and picking the one that gives the highest coherence score. Choosing that optimal ‘k’ usually offers meaningful and interpretable topics.
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
"""
Compute c_v coherence for various number of topics
Parameters:
----------
dictionary : Gensim dictionary
corpus : Gensim corpus
texts : List of input texts
limit : Max num of topics
Returns:
-------
model_list : List of LDA topic models
coherence_values : Coherence values corresponding to the LDA model with respective number of topics
"""
coherence_values = []
model_list = []
for num_topics in range(start, limit, step):
model = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=num_topics, id2word=id2word, random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
model_list.append(model)
coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_values.append(coherencemodel.get_coherence())
return model_list, coherence_values
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, start=2, limit=40, step=6)
# Plotting the graph for the purpose of choosing the optimal number of LDA topics
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.title("Choosing the Optimal Number of LDA Topics Based on the Coherence Score")
plt.xlabel("Number of Topics")
plt.ylabel("Coherence Score")
plt.legend(("coherence_values"), loc='best')
plt.show()
# Printing the coherence scores
for m, cv in zip(x, coherence_values):
print("Num Topics =", m, " has Coherence Value of", round(cv, 4))
Num Topics = 2 has Coherence Value of 0.5698 Num Topics = 8 has Coherence Value of 0.5081 Num Topics = 14 has Coherence Value of 0.5075 Num Topics = 20 has Coherence Value of 0.4841 Num Topics = 26 has Coherence Value of 0.4575 Num Topics = 32 has Coherence Value of 0.4496 Num Topics = 38 has Coherence Value of 0.4384
According to the coherence score graph and scores, after topic 15 there is a decrease in the coherence score. Between topic 8 and topic 15 there is no change in the coherence score. Based on that, I will choose the model with 8 topics for the purpose of optimizing the LDA model. The reason for choosing 8 topics is because when choosing a k that is too large (like 14 or 15 topics), I saw the same keywords being repeated in multiple topics.
# Selecting the chosen LDA model and printing the topics (8 topics)
optimal_model = model_list[1]
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))
[(0, '0.015*"say" + 0.014*"people" + 0.011*"write" + 0.009*"think" + 0.009*"know" ' '+ 0.008*"make" + 0.007*"article" + 0.006*"believe" + 0.006*"see" + ' '0.006*"come"'), (1, '0.016*"key" + 0.014*"year" + 0.013*"team" + 0.013*"game" + 0.011*"line" + ' '0.009*"get" + 0.009*"play" + 0.009*"good" + 0.008*"go" + 0.007*"win"'), (2, '0.018*"law" + 0.014*"gun" + 0.014*"public" + 0.013*"government" + ' '0.012*"state" + 0.010*"right" + 0.009*"system" + 0.009*"science" + ' '0.009*"discussion" + 0.008*"case"'), (3, '0.017*"get" + 0.016*"article" + 0.016*"write" + 0.014*"line" + 0.012*"go" + ' '0.009*"organization" + 0.009*"m" + 0.009*"car" + 0.008*"good" + ' '0.007*"nntp_poste"'), (4, '0.018*"wire" + 0.015*"item" + 0.012*"steal" + 0.012*"clearly" + ' '0.011*"ground" + 0.010*"lead" + 0.010*"laugh" + 0.009*"cable" + 0.008*"gay" ' '+ 0.007*"motto"'), (5, '0.021*"line" + 0.012*"use" + 0.010*"system" + 0.010*"organization" + ' '0.009*"nntp_poste" + 0.008*"host" + 0.008*"thank" + 0.007*"drive" + ' '0.007*"get" + 0.007*"need"'), (6, '0.604*"ax" + 0.022*"_" + 0.019*"c" + 0.014*"pin" + 0.009*"gateway" + ' '0.008*"rlk" + 0.008*"cx" + 0.005*"ei" + 0.005*"sy" + 0.004*"mc"'), (7, '0.047*"space" + 0.023*"dn" + 0.017*"launch" + 0.016*"earth" + ' '0.016*"family" + 0.013*"orbit" + 0.013*"mission" + 0.011*"moon" + ' '0.010*"satellite" + 0.009*"flight"')]
I may want to make more sense of what the topic is about. For that reason, the topic keywords may not be enough. So, to help me with understanding the topic, I would like to find the document a given topic has contributed to the most and infer the topic by reading that document.
# Grouping the top 5 sentences under each topic
sent_topics_sorteddf = pd.DataFrame()
sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')
for i, grp in sent_topics_outdf_grpd:
sent_topics_sorteddf = pd.concat([sent_topics_sorteddf,
grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)],
axis=0)
# Resetting the Index
sent_topics_sorteddf.reset_index(drop=True, inplace=True)
# Formatting
sent_topics_sorteddf.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]
# Showing the final table
sent_topics_sorteddf.head(8)
| Topic_Num | Topic_Perc_Contrib | Keywords | Text | |
|---|---|---|---|---|
| 0 | 0 | 0.9964 | go, say, people, get, know, think, gun, time, ... | Organization: University of Illinois at Chicag... |
| 1 | 1 | 0.9953 | drive, scsi, chip, line, bit, write, speed, fa... | From: (GRUBB) Subject: Re: IDE vs SCSI Organiz... |
| 2 | 2 | 0.9873 | write, line, bike, article, car, organization,... | From: (Beverly M. Zalan) Subject: Re: Frequent... |
| 3 | 3 | 0.9971 | year, team, line, game, go, write, get, articl... | From: (peter.r.clark..jr) Subject: Re: Flyers ... |
| 4 | 4 | 0.9999 | ax, rlk, _, ei, m, qax, rk, r, cj, bf | Subject: roman.bmp 07/14 From: (Cliff) Reply-T... |
| 5 | 5 | 0.9952 | key, encryption, use, ripem, line, government,... | Subject: text of White House announcement and ... |
| 6 | 6 | 0.9967 | line, space, image, program, use, work, also, ... | From: (Stephen D Brener) Subject: Intensive Ja... |
| 7 | 7 | 0.9943 | write, line, article, israeli, armenian, attac... | From: (Adam Shostack) Subject: Re: was:Go Hezb... |
And the final step for this LDA model is understanding the volume and distribution of topics to judge how widely it was discussed.
# The Number of Documents for Each Topic
topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()
# The Percentage of Documents for Each Topic
topic_contribution = round(topic_counts/topic_counts.sum(), 4)
# The Topic Number and Keywords
topic_num_keywords = df_topic_sents_keywords[['Dominant_Topic', 'Topic_Keywords']]
# Concatenating Column wise
df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)
# Changing the Column names
df_dominant_topics.columns = ['Dominant_Topic', 'Topic_Keywords', 'Num_Documents', 'Perc_Documents']
# Showing the final table
df_dominant_topics.head()
| Dominant_Topic | Topic_Keywords | Num_Documents | Perc_Documents | |
|---|---|---|---|---|
| 0 | 8 | line, write, get, article, nntp_poste, organiz... | 1316.0 | 0.1163 |
| 1 | 1 | drive, scsi, chip, line, bit, write, speed, fa... | 287.0 | 0.0254 |
| 2 | 8 | line, write, get, article, nntp_poste, organiz... | 355.0 | 0.0314 |
| 3 | 8 | line, write, get, article, nntp_poste, organiz... | 1329.0 | 0.1175 |
| 4 | 11 | line, file, get, write, window, use, program, ... | 16.0 | 0.0014 |
Latent Semantic Analysis (LSA) also known as Latent Semantic Index (LSI) is a natural language processing method that analyzes relationships between a set of documents and the terms contained within. It uses singular value decomposition, a mathematical technique, to scan unstructured data to find hidden relationships between terms and concepts.
All the preprocessing work done on the 20 newsgroups dataset is still valid here. So, I can continue straight to the LSA model.
Again, I can obtain the coherence score with the Gensim module. Let’s see how the coherence score is for the LSA model for a total of 20 topics (The same number of topics as I initially chose for the LDA model. For comparison purposes).
Note - LsiModel does not function with the log_preplexity for the calculation of the perplexity score the same as LDA does. So, I will drop the perplexity score and focus my attention only to the coherence score.
lsi = LsiModel(corpus, num_topics=20, id2word=id2word, chunksize=100)
# Computing Coherence Score
coherence_model_lsi = CoherenceModel(model=lsi, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lsi = coherence_model_lsi.get_coherence()
print('\nCoherence Score: ', coherence_lsi)
Coherence Score: 0.5474654106268115
Now, let’s see how the coherence score is for the LSA model for the range of 2 to 20 topics. The reason for doing this, as before with the LDA model, Is for choosing the optimal number of topics for getting a more “polished” topic model to understand more coherently the corpus of documents.
# finding the coherence score with a different number of topics
for i in range(2,21):
lsi = LsiModel(corpus, num_topics=i, id2word=id2word)
coherence_model = CoherenceModel(model=lsi, texts=texts, dictionary=id2word, coherence='c_v')
coherence_score = coherence_model.get_coherence()
print('Coherence score with {} clusters: {}'.format(i, coherence_score))
Coherence score with 2 clusters: 0.5474654106268115 Coherence score with 3 clusters: 0.5804817573064794 Coherence score with 4 clusters: 0.5601378163216414 Coherence score with 5 clusters: 0.5978660958341103 Coherence score with 6 clusters: 0.6168266601159772 Coherence score with 7 clusters: 0.5934774020945769 Coherence score with 8 clusters: 0.5181731035330599 Coherence score with 9 clusters: 0.5397277320313394 Coherence score with 10 clusters: 0.5211226597237258 Coherence score with 11 clusters: 0.5158202036881903 Coherence score with 12 clusters: 0.5357748057219601 Coherence score with 13 clusters: 0.5211371341484472 Coherence score with 14 clusters: 0.4830237564401526 Coherence score with 15 clusters: 0.463579838268446 Coherence score with 16 clusters: 0.47692082683039305 Coherence score with 17 clusters: 0.48779930823677753 Coherence score with 18 clusters: 0.4801059780572923 Coherence score with 19 clusters: 0.4843422484510121 Coherence score with 20 clusters: 0.45692628070938657
According to the coherence scores, after topic 6 there is a decrease in the coherence score. Based on that, I will choose the model with 6 topics for the purpose of optimizing the LSA model. The reason for choosing 6 topics is because choosing a ‘k’ that marks the end of a rapid growth of topic coherence usually offers meaningful and interpretable topics.
# performing SVD on the bag of words with the LsiModel to extract 6 topics
lsi = LsiModel(corpus, num_topics=6, id2word=id2word)
# finding the 10 words with the srongest association to the derived topics
for topic_num, words in lsi.print_topics(num_words=10):
print('Words in {}: {}.'.format(topic_num, words))
Words in 0: 1.000*"ax" + 0.001*"qax" + 0.001*"m" + 0.001*"giz" + 0.001*"ei" + 0.001*"bhj_bhj" + 0.001*"giz_giz" + 0.000*"mf" + 0.000*"tq" + 0.000*"bhj_giz". Words in 1: 0.243*"say" + 0.199*"file" + 0.197*"go" + 0.179*"get" + 0.168*"people" + 0.166*"know" + 0.144*"make" + 0.135*"see" + 0.132*"use" + 0.129*"also". Words in 2: 0.409*"file" + -0.336*"say" + -0.251*"go" + 0.167*"image" + 0.159*"program" + -0.159*"know" + -0.158*"people" + -0.139*"think" + -0.137*"s" + -0.136*"come". Words in 3: -0.581*"file" + -0.331*"entry" + 0.172*"system" + -0.135*"say" + 0.123*"use" + 0.122*"available" + -0.108*"output" + 0.107*"also" + -0.093*"program" + -0.092*"gun". Words in 4: -0.382*"image" + 0.195*"privacy" + 0.182*"internet" + -0.153*"color" + 0.139*"anonymous" + -0.138*"format" + -0.135*"say" + -0.135*"available" + -0.133*"go" + -0.131*"version". Words in 5: -0.302*"wire" + -0.222*"entry" + 0.200*"internet" + -0.190*"wiring" + 0.181*"privacy" + -0.172*"circuit" + -0.147*"ground" + 0.141*"file" + -0.131*"outlet" + 0.128*"anonymous".
After preprocessing the dataset early in my project and reaching to a final lemmatized dataset (named: data_lemmatized) containing the final product words that I have been working with. Now, I want to add those words to my dataframe as a set of rows and their corresponding words. After doing that, I would like to convert does words back to sentences for the purpose of using the sentence transformer model from BERTopic.
# Adding new column to the dataframe (named: text_cleaned)
# containing the different lemmatized words in each corresponding row.
df['text_cleaned'] = data_lemmatized
# Function to make it back into a sentence
def make_sentences(data,name):
data[name]=data[name].apply(lambda x:' '.join([i+' ' for i in x]))
# Removing double spaces if created
data[name]=data[name].apply(lambda x:re.sub(r'\s+', ' ', x, flags=re.I))
# Converting all the texts back to sentences
make_sentences(df, 'text_cleaned')
df.head()
| content | target | target_names | text_cleaned | |
|---|---|---|---|---|
| 0 | From: lerxst@wam.umd.edu (where's my thing)\nS... | 7 | rec.autos | s thing car nntp_poste host park line wonder e... |
| 1 | From: guykuo@carson.u.washington.edu (Guy Kuo)... | 4 | comp.sys.mac.hardware | clock poll final call summary final call si cl... |
| 2 | From: twillis@ec.ecn.purdue.edu (Thomas E Will... | 4 | comp.sys.mac.hardware | question organization purdue_university engine... |
| 3 | From: jgreen@amber (Joe Green)\nSubject: Re: W... | 1 | comp.graphics | system division line nntp_poste host version_p... |
| 4 | From: jcm@head-cfa.harvard.edu (Jonathan McDow... | 14 | sci.space | question organization line article pack rat wr... |
# Getting a model
model=SentenceTransformer('all-MiniLM-L12-v2')
embeddings = model.encode(df['text_cleaned'])
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms. The K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.
def find_optimal_clusters(data, max_k):
iters = range(2, max_k+1, 1)
sse = []
for k in iters:
sse.append(MiniBatchKMeans(n_clusters=k, init_size=256, batch_size=512, random_state=20).fit(data).inertia_)
print('Fit {} clusters'.format(k))
f, ax = plt.subplots(1, 1)
ax.plot(iters, sse, marker='o')
ax.set_xlabel('Cluster Centers')
ax.set_xticks(iters)
ax.set_xticklabels(iters)
ax.set_ylabel('SSE')
ax.set_title('SSE by Cluster Center Plot')
find_optimal_clusters(embeddings, 20)
Fit 2 clusters Fit 3 clusters Fit 4 clusters Fit 5 clusters Fit 6 clusters Fit 7 clusters Fit 8 clusters Fit 9 clusters Fit 10 clusters Fit 11 clusters Fit 12 clusters Fit 13 clusters Fit 14 clusters Fit 15 clusters Fit 16 clusters Fit 17 clusters Fit 18 clusters Fit 19 clusters Fit 20 clusters
According to the plot, the highest fall in the SSE is from 2-3 which means that the most optimal cluster size is 2. I am going to try with both 2 and 3 clusters.
# Beginning with 2 clusters
clusters_2 = MiniBatchKMeans(n_clusters=2, init_size=1024, batch_size=2048, random_state=20).fit_predict(embeddings)
# Defining a function for the dimensionality reduction using different techniques
def plot_tsne_pca_umap(data, labels):
max_label = max(labels)+1
max_items = np.random.choice(range(data.shape[0]), size=3000, replace=False)
reducer=umap.UMAP()
pca = PCA(n_components=2).fit_transform(data[max_items,:])
tsne = TSNE().fit_transform(PCA(n_components=50).fit_transform(data[max_items,:]))
uma=reducer.fit_transform(PCA(n_components=50).fit_transform(data[max_items,:]))
idx = np.random.choice(range(pca.shape[0]), size=320, replace=False)
label_subset = labels[max_items]
label_subset = [cm.hsv(i/max_label) for i in label_subset[idx]]
f, ax = plt.subplots(1, 3, figsize=(14, 6))
ax[0].scatter(pca[idx, 0], pca[idx, 1], c=label_subset)
ax[0].set_title('PCA Cluster Plot')
ax[1].scatter(tsne[idx, 0], tsne[idx, 1], c=label_subset)
ax[1].set_title('TSNE Cluster Plot')
ax[2].scatter(uma[idx,0],uma[idx,1],c=label_subset)
ax[2].set_title('UMAP Cluster Plot')
plot_tsne_pca_umap(embeddings, clusters_2)
We can see various dimensionality reduction techniques such as UMAP from the BERTopic model, PCA and TSNE. All are plotted against 2 clusters found by the k-means algorithm. TSNE is able to differentiate the data in approximately 60 dimensions!!!
# Moving to 3 clusters
clusters_3 = MiniBatchKMeans(n_clusters=3, init_size=1024, batch_size=2048, random_state=20).fit_predict(embeddings)
plot_tsne_pca_umap(embeddings, clusters_3)
The main difference between TSNE and UMAP is the interpretation of the distance between objects or "clusters".
TSNE preserves local structure in the data.
UMAP claims to preserve both local and most of the global structure in the data. UMAP is faster than TSNE when it concern to:
model2 = BERTopic()
topics, probabilities = model2.fit_transform(df['text_cleaned'],embeddings)
# viewing how frequent certain topics are
model2.get_topic_freq().head()
| Topic | Count | |
|---|---|---|
| 0 | -1 | 3614 |
| 1 | 0 | 1113 |
| 2 | 1 | 544 |
| 3 | 2 | 452 |
| 4 | 3 | 414 |
The topic name -1 refers to all documents that did not have any topics assigned. Not all documents are forced towards a certain cluster. If no cluster could be found, then it is simply an outlier.
After generating topics and their probabilities, I can access the frequent topics that were generated.
model2.get_topic(0)
[('team', 0.02531838599464123),
('game', 0.02438041078399716),
('player', 0.019756747153850802),
('play', 0.01716978232301229),
('season', 0.015572793356900618),
('hockey', 0.01301623122543666),
('win', 0.012775921704569709),
('year', 0.012687353632565911),
('nhl', 0.011446904475975189),
('score', 0.011411865794654865)]
I can infer from the keywords that the topic discussed in those documents relates to SPORTS
model2.get_topic(1)
[('space', 0.025939969611006874),
('launch', 0.017670839520851994),
('satellite', 0.013311532375266005),
('orbit', 0.012928079366789044),
('mission', 0.012221057412043329),
('earth', 0.010954313253211286),
('moon', 0.009484047344214675),
('rocket', 0.009068061670905783),
('flight', 0.008840779966201714),
('spacecraft', 0.008528870856080634)]
I can infer from the keywords that the topic discussed in those documents relates to SPACE
model2.get_topic(2)
[('car', 0.04418850328097452),
('engine', 0.014353478445618184),
('brake', 0.01240116371108509),
('drive', 0.011118568363411532),
('speed', 0.009631037912160334),
('tire', 0.009569073830006452),
('dealer', 0.00923751958500101),
('price', 0.009200924817820054),
('saturn', 0.008951214922059298),
('road', 0.008650953035079923)]
I can infer from the keywords that the topic discussed in those documents relates to AUTOMOBILE
model2.get_topic(3)
[('key', 0.027696690150910704),
('encryption', 0.021039274687840403),
('entry', 0.014423183320468415),
('privacy', 0.014213320141352043),
('clipperchip', 0.01274244153028646),
('security', 0.01225822027176883),
('chip', 0.011385142823592757),
('clipper', 0.01099049829575111),
('secure', 0.010447401660497797),
('file', 0.0099918349958672)]
I can infer from the keywords that the topic discussed in those documents relates to NETWORK/CYBER/SECURITY
model2.get_topic(4)
[('amp', 0.01933858776121388),
('audio', 0.014854640508848615),
('battery', 0.01454935142035318),
('sound', 0.013949004642053859),
('circuit', 0.013304362047248004),
('input', 0.013094917362452984),
('channel', 0.012716265382296114),
('stereo', 0.011944914740315237),
('output', 0.011420517863627255),
('voltage', 0.010665935415603046)]
I can infer from the keywords that the topic discussed in those documents relates to SOUND SYSTEM
model2.get_topics()
{-1: [('ax', 0.020173035305477125),
('line', 0.005284320735989547),
('write', 0.004815641311499644),
('say', 0.004811078449112976),
('know', 0.004676313617227577),
('get', 0.004382500457664024),
('article', 0.004208136192195585),
('organization', 0.004207788605474893),
('nntpposte', 0.004063237701209443),
('people', 0.004057411787023439)],
0: [('team', 0.02531838599464123),
('game', 0.02438041078399716),
('player', 0.019756747153850802),
('play', 0.01716978232301229),
('season', 0.015572793356900618),
('hockey', 0.01301623122543666),
('win', 0.012775921704569709),
('year', 0.012687353632565911),
('nhl', 0.011446904475975189),
('score', 0.011411865794654865)],
1: [('space', 0.025939969611006874),
('launch', 0.017670839520851994),
('satellite', 0.013311532375266005),
('orbit', 0.012928079366789044),
('mission', 0.012221057412043329),
('earth', 0.010954313253211286),
('moon', 0.009484047344214675),
('rocket', 0.009068061670905783),
('flight', 0.008840779966201714),
('spacecraft', 0.008528870856080634)],
2: [('car', 0.04418850328097452),
('engine', 0.014353478445618184),
('brake', 0.01240116371108509),
('drive', 0.011118568363411532),
('speed', 0.009631037912160334),
('tire', 0.009569073830006452),
('dealer', 0.00923751958500101),
('price', 0.009200924817820054),
('saturn', 0.008951214922059298),
('road', 0.008650953035079923)],
3: [('key', 0.027696690150910704),
('encryption', 0.021039274687840403),
('entry', 0.014423183320468415),
('privacy', 0.014213320141352043),
('clipperchip', 0.01274244153028646),
('security', 0.01225822027176883),
('chip', 0.011385142823592757),
('clipper', 0.01099049829575111),
('secure', 0.010447401660497797),
('file', 0.0099918349958672)],
4: [('amp', 0.01933858776121388),
('audio', 0.014854640508848615),
('battery', 0.01454935142035318),
('sound', 0.013949004642053859),
('circuit', 0.013304362047248004),
('input', 0.013094917362452984),
('channel', 0.012716265382296114),
('stereo', 0.011944914740315237),
('output', 0.011420517863627255),
('voltage', 0.010665935415603046)],
5: [('gun', 0.03860060701032071),
('firearm', 0.022079495650725138),
('weapon', 0.01604032675937555),
('handgun', 0.014248186038456057),
('guncontrol', 0.013204935795688945),
('crime', 0.012231098238337441),
('militia', 0.010754490161962898),
('criminal', 0.010442234463384973),
('right', 0.009939366728343004),
('state', 0.009511172709302112)],
6: [('israeli', 0.03498161655206188),
('arab', 0.02333324546867351),
('attack', 0.014498670170942956),
('palestinian', 0.013823798524660083),
('lebanese', 0.013076784284039937),
('soldier', 0.01083131119167203),
('civilian', 0.010717474121092729),
('peace', 0.010647873407253216),
('village', 0.010562804028671314),
('policyresearch', 0.010397257115494657)],
7: [('mail', 0.05859183838406447),
('address', 0.039653532079916363),
('nntpposte', 0.028926860579300617),
('fax', 0.028072453990112805),
('line', 0.023905474118807283),
('thank', 0.023905446997748183),
('host', 0.023893325732244605),
('email', 0.02022815593741024),
('internet', 0.019417427384648196),
('send', 0.01690817064843392)],
8: [('bike', 0.06496351838366486),
('motorcycle', 0.04161987326812391),
('ride', 0.033339920691596574),
('denizen', 0.01644598650813227),
('rider', 0.015606969427841228),
('dod', 0.014439237866022461),
('advice', 0.014260658573921633),
('recmotorcycle', 0.01101622498659433),
('mile', 0.010565035057966837),
('list', 0.010431387593701382)],
9: [('printer', 0.11357426897096855),
('print', 0.05702196762028691),
('ink', 0.037670681513257344),
('bubblejet', 0.03399936646322734),
('deskjet', 0.030223464596539372),
('postscript', 0.022068532500621837),
('font', 0.021925969370348925),
('toner', 0.02190114915319171),
('scanner', 0.021497118112210577),
('laserprinter', 0.02041586291696539)],
10: [('moral', 0.0562879140528236),
('morality', 0.04937240319191626),
('objective', 0.039358861754562526),
('objectivemorality', 0.02382439790274925),
('value', 0.01926253232906652),
('animal', 0.0170335317833117),
('specie', 0.016022288097005285),
('immoral', 0.015980602817449615),
('frankodwyer', 0.01587247291547734),
('objectivevalue', 0.014788685291665426)],
11: [('sale', 0.04421519150451173),
('ticket', 0.03560509438240559),
('hotel', 0.025320778748856874),
('offer', 0.020792391842287098),
('up', 0.019649145319916045),
('sell', 0.019412553408750654),
('mail', 0.017011192683458447),
('line', 0.01610022869001741),
('nntpposte', 0.015731871940311417),
('host', 0.015185561743365012)],
12: [('tap', 0.03852157841509703),
('government', 0.017165168354659616),
('police', 0.016459260752552853),
('key', 0.013102565060776318),
('trust', 0.012262961253193198),
('proposal', 0.010772005230080138),
('cop', 0.0107508962367134),
('good', 0.01013971052265792),
('clipper', 0.010081199103877435),
('wiretap', 0.009751336729956555)],
13: [('polygon', 0.07664445916596928),
('point', 0.03893674489465752),
('sphere', 0.038081077837062834),
('plane', 0.02962244961012423),
('edge', 0.025271756955666896),
('routine', 0.024233563114594504),
('surface', 0.024054673537049768),
('algorithm', 0.022112335540845445),
('circle', 0.021977278212372386),
('intersection', 0.0209909259557251)],
14: [('atheist', 0.04893884517223564),
('atheism', 0.04241221430090983),
('exist', 0.03061628445578153),
('belief', 0.019954752722363665),
('existence', 0.019832177993431413),
('argument', 0.018903613951371356),
('fallacy', 0.018257546718925766),
('religion', 0.01781803876424503),
('believe', 0.01705383473492208),
('theist', 0.015881311295293535)],
15: [('sale', 0.02481006953286225),
('price', 0.023841358770878323),
('software', 0.020881658804993826),
('computer', 0.016678326700701854),
('manual', 0.015688282978400248),
('apple', 0.015344637470856386),
('upgrade', 0.014460154254993769),
('list', 0.013776007887434528),
('disk', 0.013642061298940215),
('fpu', 0.01313482395317774)],
16: [('armenian', 0.05908545571010144),
('turkish', 0.03232436035834615),
('genocide', 0.02218832083890312),
('turk', 0.021737044775558313),
('serdarargic', 0.020108724995079718),
('massacre', 0.0131087001504206),
('escape', 0.012550743122526149),
('nazi', 0.012147942890944426),
('village', 0.011953088688120141),
('russian', 0.011040304622265906)],
17: [('color', 0.08862597110490583),
('colormap', 0.06300349688874046),
('bit', 0.036821680682162496),
('visual', 0.03586169745192511),
('standardcolormap', 0.027231051529045616),
('depth', 0.025779459923563328),
('client', 0.02311747151849819),
('display', 0.02237733488102167),
('colour', 0.02097663016118978),
('screen', 0.019929988410128437)],
18: [('food', 0.06346362245983232),
('msg', 0.053809079103557166),
('superstition', 0.03968808184006058),
('glutamate', 0.029865678433939512),
('taste', 0.028976187613034984),
('reaction', 0.02509663095018179),
('msgsensitivity', 0.02268231828920453),
('effect', 0.020898428274333546),
('chineserestaurant', 0.020202126349109253),
('study', 0.01839592003763124)],
19: [('muslim', 0.027038665821038622),
('islamic', 0.02189971211644326),
('sex', 0.02138244676598906),
('rushdie', 0.021336207821779942),
('religion', 0.017249676003408942),
('gregg', 0.01660662658774099),
('woman', 0.016372612029085912),
('greggjaeger', 0.0151479230367414),
('depression', 0.01484043269589068),
('marriage', 0.014283669741626343)],
20: [('sin', 0.03954147651400956),
('faith', 0.023837032523959863),
('prayer', 0.02337432034404498),
('love', 0.0233336933555494),
('salvation', 0.02002137922331453),
('god', 0.016535365656587368),
('commandment', 0.016302953911367178),
('christian', 0.013930437704047692),
('man', 0.012955090033372234),
('aid', 0.012928529000638221)],
21: [('cx', 0.05832370768176702),
('sc', 0.049281712607800954),
('rlk', 0.04530922098345502),
('scx', 0.04037756188750717),
('format', 0.03610419888067061),
('sy', 0.03508745798357945),
('file', 0.03414332774952954),
('cj', 0.029244421971362275),
('image', 0.028151439573020434),
('cxs', 0.02523424370099946)],
22: [('drive', 0.09344372848175847),
('boot', 0.04595409545372484),
('rombio', 0.03874923347123469),
('disk', 0.03424902237047937),
('harddisk', 0.03423138946956143),
('feature', 0.02658110759809402),
('controller', 0.02615418072733169),
('system', 0.022900671460440166),
('bio', 0.022485170835851626),
('westerndigital', 0.020802401740781223)],
23: [('monitor', 0.11790037508292904),
('vga', 0.042051746246658576),
('vgamonitor', 0.031447296516934946),
('video', 0.022170960227908366),
('card', 0.020617569182584918),
('resolution', 0.020038556115396487),
('viewsonic', 0.018807589178834218),
('necfg', 0.017871128453768072),
('mode', 0.017082381048776945),
('tube', 0.01595638400144289)],
24: [('libertarian', 0.03723633120642859),
('government', 0.03266469196975339),
('stevehendrick', 0.024651945300169297),
('employment', 0.013610441616640392),
('libertarianism', 0.01338696975840834),
('regulation', 0.012171652856425645),
('economy', 0.01196064469508917),
('socialism', 0.011890445807076263),
('welfare', 0.011011468313285774),
('country', 0.01098086900390882)],
25: [('mhz', 0.07099157056321548),
('clock', 0.05160925998636245),
('processor', 0.04430575682151527),
('speed', 0.039090681455036164),
('pentium', 0.037424422958120616),
('cpu', 0.031423689658895755),
('instruction', 0.029018939111080833),
('performance', 0.025472325444006693),
('cisc', 0.024020501912967167),
('architecture', 0.023503914782384446)],
26: [('church', 0.047154297017551694),
('pope', 0.02880190853941829),
('catholic', 0.026000488476967155),
('doctrine', 0.01887174743664526),
('schism', 0.01858300047898671),
('revelation', 0.015534341921285682),
('bishop', 0.01409634811141122),
('sin', 0.014042103033983134),
('schismatic', 0.01312873659161141),
('trinity', 0.012864148429777744)],
27: [('nntpposte', 0.022755598467944557),
('host', 0.021176879195538328),
('edmccreary', 0.021115493697488413),
('robertweiss', 0.018321189768368),
('write', 0.017231737608341697),
('sequel', 0.01626017202090918),
('schiewer', 0.01591691245512373),
('rossborden', 0.01591691245512373),
('billconner', 0.015758619445428508),
('organization', 0.015378072101627381)],
28: [('science', 0.033076526712751235),
('contradictory', 0.022663375356933465),
('universe', 0.01875060722802443),
('god', 0.018068516469823343),
('exist', 0.017663298968191594),
('origin', 0.017271037445979874),
('language', 0.013989916165823691),
('description', 0.013054099795900689),
('false', 0.012499045110776865),
('say', 0.012314570332154634)],
29: [('survivor', 0.06208928932374977),
('dividianranch', 0.06092009314413341),
('atfburn', 0.055138755384206505),
('fire', 0.033817185835534475),
('atf', 0.024466796691918997),
('stove', 0.02281301663399489),
('napalm', 0.022352027414002062),
('woodstove', 0.019934146508674873),
('never', 0.019261053863044324),
('insideignite', 0.018709015639161566)],
30: [('fire', 0.028961967979713576),
('compound', 0.02719764182424098),
('scottroby', 0.024573529737630752),
('child', 0.022306471874888475),
('tearga', 0.018088590972270738),
('murdersalmost', 0.017470221562138167),
('batf', 0.01686957221810202),
('koresh', 0.016822674139829536),
('agent', 0.013025369353435371),
('affair', 0.012771519232819505)],
31: [('scsi', 0.21369832339677117),
('ide', 0.04499483387543342),
('mb', 0.043485931106553175),
('drive', 0.04244432841003108),
('device', 0.032370708308683535),
('esdi', 0.02844947732550255),
('fast', 0.0280105910197716),
('interface', 0.0270831224759238),
('pc', 0.024651399926943367),
('transfer', 0.02269526284693792)],
32: [('rock', 0.04170414651532044),
('kid', 0.03432825336725548),
('warning', 0.030270818333784644),
('overpass', 0.028273063823245258),
('car', 0.026368547139087272),
('teenager', 0.022725033585320197),
('read', 0.017638208635995623),
('kill', 0.015844630362782528),
('bridge', 0.015414211276658917),
('keywordsbrick', 0.01522975379895767)],
33: [('insurance', 0.0764419593911723),
('fault', 0.031912836910183394),
('deductible', 0.02941388019849342),
('car', 0.028262795427504157),
('pay', 0.025823352308636256),
('accident', 0.023572205594853953),
('rate', 0.02236804343299877),
('sticker', 0.021661316738831073),
('company', 0.01923565270501552),
('farm', 0.01845286404333179)],
34: [('card', 0.12265783967540461),
('color', 0.029462904250679323),
('vram', 0.027648078694579366),
('fast', 0.02615799937223588),
('bit', 0.025639851222644262),
('graphic', 0.024929280624347194),
('video', 0.023467914529196582),
('orchid', 0.021502900244684556),
('monitor', 0.02088095686756272),
('performance', 0.020741545503941257)],
35: [('gordonbank', 0.06075334932347258),
('jxpskepticism', 0.03804024234550143),
('shameful', 0.0377487688242111),
('transplant', 0.03771163897182705),
('liver', 0.03771163897182705),
('intellect', 0.037196092832077685),
('surrender', 0.03419647034628476),
('pain', 0.032307976060557235),
('computerscience', 0.027388187021648438),
('soon', 0.02621621299610939)],
36: [('countersteere', 0.0546877433841349),
('bike', 0.04877682032873622),
('rider', 0.038299126675393064),
('motorcycle', 0.03067092057082276),
('rein', 0.029006019772890773),
('steer', 0.028805738148888215),
('lean', 0.02791722279023156),
('turn', 0.026180429004502258),
('swerve', 0.02394755006775863),
('technique', 0.02363282505561421)],
37: [('post', 0.019914128137724876),
('funny', 0.01956202537267302),
('article', 0.01176354021736683),
('joke', 0.011477582915284284),
('write', 0.01022437894740424),
('pray', 0.010223770705213018),
('day', 0.009934511768020602),
('lord', 0.009747242912610272),
('opinion', 0.009578837151833742),
('naive', 0.009268271712870679)],
38: [('yeast', 0.0450882884467131),
('nystatin', 0.029603291817926487),
('sinus', 0.0276589295888865),
('infection', 0.026219878231877793),
('treatment', 0.026079565964799923),
('antibiotic', 0.022706408336135315),
('symptom', 0.022573598682208276),
('oily', 0.021939991749768854),
('acne', 0.020484623818328385),
('quack', 0.01946874304104085)],
39: [('cruel', 0.05441532740739915),
('deathpenalty', 0.044730854591394247),
('innocent', 0.042274675563293125),
('murder', 0.037619971493148996),
('kill', 0.03392094559051295),
('punishment', 0.03184924710187425),
('politicalatheist', 0.031399768119205404),
('commit', 0.022521894329948895),
('system', 0.02128501542026509),
('execute', 0.021268459329452434)],
40: [('greek', 0.08020641653573837),
('turkish', 0.03369632413144676),
('turk', 0.03269440476658686),
('greece', 0.028928514657449222),
('turkishminority', 0.017408874429658126),
('ethnic', 0.014070776009280528),
('government', 0.013156072280788028),
('minority', 0.012286470353335341),
('armenian', 0.011732204574380851),
('book', 0.010918609256852186)],
41: [('christian', 0.028394179464941192),
('liarlunatic', 0.020646137204369155),
('liar', 0.018112863391980438),
('die', 0.017753213265782383),
('religion', 0.017271035186176904),
('people', 0.016419194728729356),
('prophecy', 0.016180596459817377),
('heal', 0.014959658796960899),
('christianity', 0.014270679449012029),
('bible', 0.013657745636262403)],
42: [('science', 0.05410192624260946),
('methodology', 0.038157365731421235),
('scientific', 0.028327819447902473),
('hypothesis', 0.02383097150484523),
('theory', 0.02219755626013905),
('experiment', 0.02002323269457525),
('sequence', 0.018712247402045572),
('fantasy', 0.01838389230373775),
('homeopathytradition', 0.017121392122000326),
('protein', 0.016970670662207647)],
43: [('keyboard', 0.07167842682785401),
('key', 0.06853053938161754),
('accelerator', 0.059373708257558035),
('shift', 0.03731161727041526),
('modifier', 0.03168700185618704),
('ctrl', 0.027729619834106985),
('translation', 0.027684760231644637),
('ctrlkey', 0.026232417694033425),
('define', 0.02220227625785368),
('menu', 0.021250011032195728)],
44: [('msmyer', 0.030126179171653535),
('president', 0.02858272403472137),
('job', 0.02243011759524676),
('work', 0.014895331907671732),
('russian', 0.014188154049630732),
('senioradministration', 0.014040473766190413),
('go', 0.013976089056400841),
('think', 0.013275827602901066),
('package', 0.013017902260384111),
('official', 0.012623735448998677)],
45: [('absolute', 0.055735603946341944),
('truth', 0.04927651335520828),
('arrogance', 0.03242650672704986),
('belief', 0.031229075530699573),
('arrogant', 0.02572439095961214),
('believe', 0.023397014686843617),
('authority', 0.02287105825546018),
('scripture', 0.021087703793337746),
('absolutetruth', 0.020851625308633732),
('evidence', 0.01896131869983727)],
46: [('drug', 0.14797412998928966),
('legalization', 0.031063865034182004),
('legalize', 0.03079586791079164),
('war', 0.024373591850061393),
('cocaine', 0.02319757444457153),
('wod', 0.022912426135865943),
('hypocrisyt', 0.022593196407818015),
('cigarette', 0.021073318320382468),
('ryanscharfy', 0.02085791894702093),
('legal', 0.019062508475203185)],
47: [('oil', 0.1298915250164159),
('changingoil', 0.051355660417275885),
('bolt', 0.037605938997090756),
('self', 0.02675426684923848),
('quart', 0.025098667604793946),
('car', 0.02461801024730351),
('wrench', 0.02449879960884285),
('mile', 0.024309211904860725),
('hole', 0.022425222379998385),
('cylinder', 0.02225534374119694)],
48: [('homosexual', 0.04992373870889151),
('gay', 0.049510865215272606),
('man', 0.04275225543190616),
('promiscuous', 0.03971101786225638),
('dramatically', 0.03898059444397359),
('percent', 0.03716988805769234),
('sexualpartner', 0.036165596010046534),
('gaypercentage', 0.03364370542019478),
('kinseyreport', 0.031811951897593345),
('study', 0.030918351109309315)],
49: [('gateway', 0.03191204922339655),
('host', 0.029304789327306065),
('nntpposte', 0.02860578960236699),
('instal', 0.025599369309447367),
('problem', 0.025259484281414432),
('register', 0.022907850370720413),
('exception', 0.02242729809727298),
('erme', 0.021117780923349424),
('syst', 0.021117780923349424),
('buy', 0.020376923183803512)],
50: [('image', 0.04820218617393005),
('graphic', 0.030770170694722796),
('plot', 0.028907678282998957),
('plplot', 0.028301876072960665),
('package', 0.02494842406231188),
('tool', 0.023684549036871522),
('library', 0.01940251387718215),
('analysis', 0.01759715117867585),
('user', 0.017024091142944522),
('cad', 0.01658352540038222)],
51: [('simms', 0.09951154755118824),
('simm', 0.07266740424919722),
('memory', 0.06519869038972927),
('ram', 0.05024463905132314),
('chip', 0.03289808623771293),
('dram', 0.030797289918589883),
('refresh', 0.029527133112702397),
('meg', 0.027543335994726213),
('pinsimms', 0.024864540744881664),
('cycle', 0.02240759020291665)],
52: [('medicine', 0.032484360396296416),
('psychoactive', 0.03148644825088931),
('prozac', 0.03148644825088931),
('disease', 0.0298138270149792),
('patient', 0.028647930533805828),
('effect', 0.02807983170126024),
('drug', 0.027244423603769878),
('placebo', 0.026634070236029473),
('gr', 0.02392555784869406),
('ronroth', 0.02361483618816698)],
53: [('crosslinke', 0.09103812154657334),
('allocationunit', 0.0762407154427419),
('window', 0.0501502064489674),
('cfg', 0.043896714543846846),
('gfxvpic', 0.043896714543846846),
('cluster', 0.03520390471102671),
('exe', 0.03195647346637822),
('crash', 0.02943622943163868),
('keepscrashe', 0.027670297416741537),
('file', 0.02620222294963896)],
54: [('widget', 0.023471939484907892),
('available', 0.019515194089830674),
('server', 0.016024083445308947),
('pub', 0.015576587152263783),
('application', 0.014169811390045937),
('version', 0.013075610505694363),
('include', 0.012838382673676605),
('file', 0.01198965835660298),
('graphic', 0.01154331465007485),
('resource', 0.010963987745404517)],
55: [('hell', 0.053668898828743775),
('atheist', 0.03254297234343234),
('eternal', 0.02702139940619704),
('eternaldeath', 0.026692348569234574),
('believe', 0.02456989015379998),
('die', 0.021859269756740133),
('resurrection', 0.01752814454652577),
('body', 0.016582288188248105),
('human', 0.014571867421844086),
('death', 0.014529554382306941)],
56: [('trial', 0.06933193793065276),
('cooper', 0.03981411372892238),
('witness', 0.038398387794035094),
('weaver', 0.03425271512538571),
('verdict', 0.02684055124228059),
('plaintiff', 0.024259144848878783),
('spence', 0.02379864135436258),
('new', 0.0225954340312973),
('jury', 0.021438631137387015),
('court', 0.020466096177050152)],
57: [('coolingtower', 0.07288291755418987),
('water', 0.0614437688597253),
('plant', 0.05814777816846494),
('steam', 0.05406956241738616),
('uranium', 0.04376031214404917),
('cool', 0.03553702423484041),
('nuclear', 0.03358308971889222),
('reactor', 0.03270638832896486),
('nuclearsite', 0.030149828468382135),
('energy', 0.029265507822705518)],
58: [('game', 0.08557625873482862),
('segagenesis', 0.04222881997500058),
('genesis', 0.039103131942929664),
('sale', 0.038643576555622006),
('controller', 0.028923834170049845),
('trade', 0.028495643348901954),
('sne', 0.02562278751332727),
('nintendo', 0.023115580790410272),
('docsdisk', 0.02268380585287087),
('super', 0.0219612971693762)],
59: [('radardetector', 0.13409420428748786),
('detector', 0.0932187533897058),
('radar', 0.08863829691361524),
('beam', 0.03552994932171792),
('car', 0.029970765672937005),
('detect', 0.02886151305258383),
('police', 0.026993025698523552),
('speedometer', 0.02538816804091439),
('receiver', 0.024909038491534724),
('radio', 0.023712380038912465)],
60: [('mswindow', 0.06492447422990312),
('window', 0.04743095081040116),
('icon', 0.0442483869217622),
('manager', 0.02760165168014481),
('cursor', 0.023475291523754566),
('delete', 0.022915603971568433),
('group', 0.020712085401661),
('program', 0.02047556753127674),
('finetune', 0.018691456022765594),
('version', 0.01724501878966719)],
61: [('nazi', 0.052317048851381255),
('hitler', 0.03804763311521923),
('german', 0.025630914364762),
('limbaugh', 0.025047608747233184),
('party', 0.016446732327856383),
('chancellor', 0.014967845385512947),
('homosexual', 0.013368280441032816),
('history', 0.01331805065261802),
('side', 0.012989095524758307),
('himmler', 0.010485524446881665)],
62: [('mouse', 0.23851835429973228),
('com', 0.06156532114019581),
('movesmoothly', 0.0417332228791408),
('driver', 0.03519122166834939),
('jump', 0.025469722063506575),
('mousejump', 0.023054509302947848),
('verticalmotion', 0.023054509302947848),
('horizontalmotion', 0.023054509302947848),
('apple', 0.022216549892712024),
('click', 0.021025141500104687)],
63: [('compile', 0.0796015226439554),
('libxmu', 0.05489135720333669),
('symbol', 0.048397410910518315),
('error', 0.04784893378983467),
('explorationproduct', 0.041276765753900206),
('suno', 0.03695305611357901),
('sug', 0.03484990110252485),
('makefile', 0.032599081850764475),
('undefineddoug', 0.032304496392432436),
('problem', 0.030542806810446077)],
64: [('worship', 0.06839601820241566),
('sabbath', 0.059975806211859446),
('law', 0.05790627125061493),
('gentile', 0.041623743449504834),
('day', 0.03546778747298005),
('ceremonial', 0.030572728169981418),
('christian', 0.030041613802848123),
('sabbathadmission', 0.022182511923863574),
('paul', 0.021989876785142658),
('jewish', 0.02079214569943989)],
65: [('tape', 0.10519664310014572),
('disk', 0.07546218013214138),
('drive', 0.04996700444087918),
('backup', 0.043972820289968816),
('floptical', 0.03644681802474545),
('hole', 0.02677339478686914),
('floppy', 0.02501168138865748),
('nilaypatel', 0.024045612379201022),
('marker', 0.021891861237246055),
('optical', 0.020431414543603546)],
66: [('modem', 0.16014432693622221),
('baud', 0.036664227757766316),
('fax', 0.03487527990342778),
('string', 0.03474273061031691),
('firstclass', 0.0289159128547647),
('robotic', 0.028742977572820995),
('setting', 0.027307119651124166),
('duo', 0.026040527868362136),
('cable', 0.02246651266204908),
('warranty', 0.021998128402039536)],
67: [('seizure', 0.12954213075170964),
('corn', 0.10896733846046479),
('cereal', 0.07769878444433499),
('food', 0.0381151164874379),
('diet', 0.03672753242847102),
('relatedseizure', 0.03411954898348687),
('infantilespasm', 0.031183827588471737),
('kellog', 0.028192241361637397),
('disorder', 0.025854451841505456),
('sugarcoate', 0.025137758295453703)],
68: [('resurrection', 0.04569956976703852),
('rise', 0.04317784183430807),
('impact', 0.023389371446158578),
('jewish', 0.022445014345734402),
('body', 0.021458645041361177),
('roman', 0.021366574334352312),
('lie', 0.018037757406375154),
('emery', 0.017151377152668432),
('believe', 0.016447551751553324),
('lukesaccount', 0.016020186238275363)],
69: [('helmet', 0.2192452140550481),
('shoei', 0.03689432359065039),
('liner', 0.03250449907851934),
('bike', 0.029274771898348848),
('impact', 0.02899516534932512),
('passenger', 0.027333387810211055),
('primaryconcern', 0.026278840408206876),
('damage', 0.0261579014466072),
('seat', 0.02612500438594434),
('size', 0.0258145348932906)],
70: [('sale', 0.06236273733803827),
('disk', 0.046820084166751125),
('drive', 0.03426508786611373),
('apple', 0.03218318382169636),
('include', 0.030190813564505516),
('manual', 0.029368798055769568),
('rodneyjack', 0.028336901725759197),
('dbase', 0.027738696948492323),
('card', 0.025791692577726268),
('commodore', 0.025647699109219543)],
71: [('lens', 0.08970656448815056),
('camera', 0.08757099533925322),
('projector', 0.07271553813203718),
('lense', 0.061320954981221525),
('sale', 0.038243402868646956),
('sell', 0.03435626675112036),
('zoom', 0.0331572768913827),
('price', 0.032168999079498446),
('strap', 0.031601162454873835),
('video', 0.02866195944839994)],
72: [('monitor', 0.06605113540711696),
('color', 0.057121203968264825),
('screen', 0.053194548816159724),
('video', 0.049385537724491516),
('problem', 0.047896995238572854),
('apple', 0.037616708363021716),
('window', 0.03540509600676232),
('scrolling', 0.02867528112504405),
('accummulate', 0.025932596354843414),
('horizontal', 0.024963351802307934)],
73: [('claytoncramer', 0.05023087531397783),
('homosexual', 0.03968014597556204),
('sexualorientation', 0.03954729331761143),
('gay', 0.03606032721252935),
('optilinkcramer', 0.028666050348415302),
('rape', 0.026487517236246802),
('professor', 0.026108567481876125),
('sexual', 0.025376079310436686),
('female', 0.025196972268290863),
('minerelation', 0.024837564301247683)],
74: [('tiff', 0.09237370610575327),
('tiffphilosophical', 0.06058137909392974),
('significance', 0.04793757090004193),
('douglasadam', 0.024251000463952538),
('spec', 0.022941550480138122),
('gripe', 0.020018045720321037),
('alice', 0.019541611974000447),
('tully', 0.016725279836805617),
('philosophicalsignificance', 0.016725279836805617),
('question', 0.016303598351641264)],
75: [('phone', 0.10210800190203267),
('number', 0.09404855457129553),
('ozone', 0.0653247791923949),
('dial', 0.058929311336368556),
('jackmounte', 0.05384082732072073),
('greetingssituation', 0.05384082732072073),
('operator', 0.0495486430997516),
('find', 0.03630331664033296),
('line', 0.03338570327010134),
('trace', 0.0332648414068542)],
76: [('openwindow', 0.03802976048693809),
('window', 0.03736796236953475),
('problem', 0.032794036795374445),
('uart', 0.03016031775935999),
('server', 0.029883556853598416),
('com', 0.028733509371414327),
('port', 0.02258604944013815),
('card', 0.021075656380948573),
('run', 0.020971288012049002),
('patch', 0.01935440151817624)],
77: [('window', 0.13197824497757174),
('windowmanag', 0.11056598151210643),
('position', 0.06993633627496178),
('decoration', 0.056247722081152064),
('selepntr', 0.05428177414211533),
('specificcoordinate', 0.04648204310147628),
('specify', 0.04255225362670293),
('tomlastrange', 0.03718563448118103),
('sibling', 0.03718563448118103),
('tobiasdope', 0.034245891266888644)],
78: [('marriage', 0.13011600368378642),
('marry', 0.10224895746902628),
('married', 0.06032505879775998),
('ceremony', 0.04894646352438059),
('divorce', 0.037957548995177344),
('wedding', 0.03544753372540982),
('commitment', 0.03454046130980365),
('church', 0.030893054566486947),
('couple', 0.029120207262619765),
('priest', 0.02461147698747508)],
79: [('widget', 0.1455654734442187),
('gl', 0.10612163196220521),
('gadget', 0.042363728689633305),
('xmdrawingarea', 0.041117599711889316),
('application', 0.03903441883484099),
('circular', 0.03642390772606996),
('glxmdraw', 0.0360037114854424),
('motif', 0.03503352035645648),
('athenawidget', 0.031169663805091858),
('ibmrs', 0.028821118456155828)],
80: [('mormon', 0.07755856950436148),
('religion', 0.021300610235883883),
('secularauthoritie', 0.02061833682713054),
('ld', 0.019342948414348915),
('church', 0.018270086033943894),
('casperknie', 0.018253375022232024),
('sect', 0.017188112268386214),
('persecution', 0.015623764452659741),
('peteyadlowsky', 0.015500682431069387),
('rld', 0.013371512786196155)],
81: [('driver', 0.24607063821299033),
('videocard', 0.07190173992678746),
('card', 0.06487287754592932),
('color', 0.06292496266022757),
('wong', 0.05320813884102648),
('dualpage', 0.04445593861384696),
('wak', 0.04445593861384696),
('ftpsite', 0.04365728718791773),
('window', 0.04277683153896209),
('speedstar', 0.042658650636166244)],
82: [('date', 0.055123551967192076),
('timer', 0.05323847922821745),
('timing', 0.04538381666131422),
('snow', 0.03822889705131632),
('menu', 0.03575517057251557),
('pellet', 0.035279842686016215),
('ultra', 0.033243778817137686),
('battery', 0.03167888338815815),
('clock', 0.030856566838408196),
('crystal', 0.0271672116894724)],
83: [('cop', 0.06575911792088596),
('ticket', 0.04697413018421206),
('intoxicated', 0.03247070953500257),
('speedymercer', 0.029662652921589202),
('liquor', 0.028862852920002287),
('officer', 0.02848120333080181),
('court', 0.0284364324913694),
('dwi', 0.022358200984169373),
('drunkdrive', 0.021982857725844508),
('speed', 0.021567350906782023)],
84: [('coolant', 0.04808196424024301),
('heat', 0.03984173637128635),
('substitute', 0.039684509533410864),
('airconditione', 0.03769983189520441),
('freon', 0.029951394651356156),
('oven', 0.028940941272821884),
('pump', 0.02645389371224814),
('peltiereffect', 0.024818885014908143),
('retrofit', 0.023841142886180604),
('air', 0.02356229725082875)],
85: [('deficit', 0.06589492600586329),
('tax', 0.0653150467867),
('vat', 0.05879470504369579),
('taxis', 0.048375017779792535),
('capitalgain', 0.031164526247537707),
('economic', 0.02851698363387834),
('investor', 0.025617006546753633),
('spending', 0.02539515440180188),
('revenue', 0.022676799611918194),
('rate', 0.02226519874314681)],
86: [('ax', 0.05928273684496523),
('cj', 0.035578788558542095),
('rk', 0.03064514345397361),
('sj', 0.022996362074659098),
('rlk', 0.0213248489623028),
('lhz', 0.02124579702943873),
('cx', 0.02026565562133745),
('japanese', 0.01922259969654832),
('ai', 0.016100430090612076),
('vz', 0.015020523831344243)],
87: [('henrik', 0.05555200374031482),
('plane', 0.049832710685590365),
('turkishplane', 0.04866347231674578),
('armenian', 0.04654495675582742),
('azeris', 0.041031409683375215),
('shoot', 0.03538534292167865),
('homeland', 0.03304445385150035),
('forge', 0.03156661752883352),
('search', 0.0312195600188601),
('turkish', 0.030933303825198055)],
88: [('jazz', 0.06913934342690123),
('sale', 0.06888077579873401),
('rollingstone', 0.06307043023062767),
('rpmsingle', 0.05211372497053604),
('vinyl', 0.044949309539128464),
('capitolpicture', 0.0406882092623587),
('music', 0.038485348487042374),
('sleeve', 0.0377001008006893),
('promopicture', 0.03475353526364703),
('cd', 0.03383899352105777)],
89: [('motto', 0.12238060257950377),
('pompousass', 0.04237904933509004),
('thing', 0.03296994081762076),
('little', 0.0277923402145253),
('change', 0.026147577786321718),
('populationgrowth', 0.023878882099984296),
('coin', 0.023258447621310026),
('farzinmokhtarian', 0.021680482378181522),
('schneider', 0.021112941032553564),
('freedom', 0.020915221618223692)],
90: [('dog', 0.22084225914840827),
('chase', 0.0404788849679357),
('bike', 0.03859342768956505),
('ride', 0.029784723797756978),
('driveway', 0.029375218086413572),
('road', 0.019816632653833644),
('encounter', 0.0197278704014323),
('territory', 0.018900135406056087),
('attack', 0.018786511578655164),
('dispense', 0.018596945557767572)],
91: [('adjective', 0.03799328913126648),
('white', 0.03771218538748583),
('black', 0.03756651516176531),
('whitemale', 0.032935839932292446),
('redneck', 0.025335261486378803),
('africanamerican', 0.023222658987323806),
('male', 0.022030121233995184),
('loser', 0.021977092972418153),
('large', 0.020565000774837194),
('rodneyke', 0.01798526458465361)],
92: [('godshape', 0.055344082298722036),
('heart', 0.03910434523124264),
('christianity', 0.03181851405658532),
('hole', 0.031404109370420576),
('peoplesspiritual', 0.02703776507125993),
('life', 0.026915970212135914),
('atheist', 0.02556230726124682),
('infectious', 0.024110662739863568),
('drug', 0.02366517854553703),
('christian', 0.022213638802563697)],
93: [('pop', 0.08461398597439444),
('popup', 0.0635093995176973),
('dialogbox', 0.05973571503071978),
('button', 0.05691262623823576),
('window', 0.053209793998622266),
('dialog', 0.048765005087910644),
('event', 0.03381516898137644),
('time', 0.03266626590104498),
('application', 0.03179351876098632),
('program', 0.03151406054730094)],
94: [('meat', 0.06621180864303135),
('smoke', 0.06519452498353083),
('carcinogenic', 0.059866362817070064),
('barbecuedfood', 0.05422362115449456),
('healthrisk', 0.05275141661921509),
('charcoal', 0.04152590713153676),
('barbecue', 0.04108247350208514),
('wood', 0.03487352132203682),
('food', 0.03416374035544622),
('carcinogen', 0.03230720445555953)],
95: [('font', 0.1946849495098767),
('alavi', 0.05987634014930445),
('character', 0.05983101735600206),
('window', 0.05894972757941979),
('xterm', 0.035833034671444844),
('spacify', 0.03446081147583439),
('change', 0.030624038177386334),
('disappear', 0.029007190816133253),
('text', 0.02835149601784968),
('trivial', 0.027783461385210026)],
96: [('homosexuality', 0.06874318355645229),
('gay', 0.06462327650350864),
('homosexual', 0.04455533205055673),
('sex', 0.027083092817949173),
('sin', 0.020229312449639485),
('people', 0.016284034304038946),
('lesbian', 0.015093933339820913),
('community', 0.01338111412229549),
('christian', 0.013262660113294683),
('church', 0.01298454697426851)],
97: [('joystick', 0.13261781512706838),
('int', 0.0650652983391651),
('arcadestyle', 0.052517114905709227),
('game', 0.046989680849580016),
('joystickport', 0.040846644926662734),
('gamecard', 0.032153743717699225),
('button', 0.031049923773138942),
('read', 0.029566369506340045),
('augment', 0.028393304043955427),
('atari', 0.025096310212839187)],
98: [('doctor', 0.10561250490212033),
('ultrasound', 0.0838195449880376),
('radiologist', 0.07849299820762998),
('clinic', 0.04425537370213476),
('apology', 0.029955408950887736),
('patient', 0.02837734795497413),
('prostate', 0.026330939175571368),
('medical', 0.025896871186714684),
('wife', 0.023653927870514586),
('receptionist', 0.02314648043530875)],
99: [('bus', 0.12729742916987216),
('idecontroller', 0.07083387786031094),
('mhz', 0.07080966137616336),
('speed', 0.06859442502958821),
('localbus', 0.062238432330111164),
('controller', 0.054084506859578836),
('slow', 0.04111529566302196),
('ram', 0.03693448843821766),
('memory', 0.0317688263248472),
('card', 0.030915213959103988)],
100: [('translation', 0.021866289024248125),
('hebrew', 0.01881806381398455),
('greek', 0.01875415359910381),
('early', 0.017411188040827846),
('text', 0.01731113668209478),
('hang', 0.016763728354137388),
('word', 0.015064694764207702),
('inerrant', 0.014956317581484383),
('book', 0.014731040975354161),
('language', 0.014718604778796069)],
101: [('newsgateway', 0.14940457215551722),
('utexas', 0.13188503350382513),
('prolineinternet', 0.1130531614280177),
('uucpuunet', 0.07936720471405528),
('trinomial', 0.06351728937375349),
('mail', 0.053146258432030656),
('atm', 0.04980152405183907),
('host', 0.04379320064054297),
('prcgs', 0.038686232980218456),
('nntpposte', 0.03861165000587071)],
102: [('driver', 0.06469877895836006),
('card', 0.0641229020829706),
('protectionfault', 0.05119318299431993),
('atiultra', 0.04816409330151821),
('window', 0.043518626305533686),
('gateway', 0.04320233840638738),
('gatewaydx', 0.04310059577960423),
('flex', 0.04133139596981568),
('experiencedfaint', 0.035389171581685086),
('atis', 0.035389171581685086)],
103: [('order', 0.08831709994874071),
('orientaltemplar', 0.0553271917284988),
('rosicrucianord', 0.05257236007042261),
('ancient', 0.05036543952658032),
('tonyalicea', 0.04596305052181607),
('orientis', 0.035964127316645805),
('reuss', 0.03541488998207717),
('goldendawn', 0.03261513453989394),
('ordotempli', 0.027955829605623376),
('spinoff', 0.0244874366533497)],
104: [('migraine', 0.16223990073890543),
('pain', 0.09834838866464185),
('headache', 0.056667075255279656),
('exercise', 0.04143613255490441),
('gordonbank', 0.038491299707130326),
('analgesic', 0.03630595257654058),
('patient', 0.030063306714025125),
('leg', 0.02804293186847255),
('tennis', 0.026679497611496007),
('dn', 0.026679497611496007)],
105: [('wire', 0.07834341561141636),
('ground', 0.0662355416562859),
('wiring', 0.06547330376509235),
('outlet', 0.056675975611596804),
('neutral', 0.0563862526847771),
('circuit', 0.047885800899032015),
('gfci', 0.03532001646091766),
('breaker', 0.02882751075835729),
('panel', 0.027437407192693306),
('electrical', 0.02704972616731253)],
106: [('ch', 0.1072145669907265),
('aspect', 0.08944348470821777),
('group', 0.08676093860072716),
('splitpersonally', 0.07368947430042457),
('wate', 0.0688543790160629),
('graphic', 0.05962494805940736),
('convenience', 0.05488491745866772),
('forum', 0.05068458234483233),
('michaelnerone', 0.048717956618850373),
('favor', 0.04303065330717183)],
107: [('sharedmemory', 0.09984862267196293),
('server', 0.08077634616320065),
('animation', 0.07149458451622395),
('xputimage', 0.06938048157902864),
('pixmap', 0.06470934452021422),
('segment', 0.040227316283342356),
('client', 0.03982257187611942),
('extension', 0.03869337527026248),
('xview', 0.038520568497325824),
('sunview', 0.03603883125581173)],
108: [('line', 0.12278415264632199),
('calibra', 0.12117262298349488),
('hoi', 0.1119321674227654),
('nunnery', 0.10652747265778488),
('spec', 0.10244445087130769),
('oakland', 0.09971947718315788),
('netlander', 0.09971947718315788),
('thee', 0.09345728011382798),
('fli', 0.09188834529403005),
('crush', 0.07939914479165967)],
109: [('video', 0.13384789737590141),
('tape', 0.06879519563935993),
('vcr', 0.06603722014903728),
('tv', 0.04886008328086996),
('copy', 0.044028453454146355),
('quicktime', 0.04203231102340678),
('react', 0.040761481634967164),
('protection', 0.035972707943782004),
('frame', 0.03378836666370587),
('ntsc', 0.030754108677866147)],
110: [('fifthamendment', 0.06262115616779944),
('password', 0.06102458469881448),
('key', 0.05188237823383673),
('compel', 0.03443518674786847),
('disclosure', 0.031964251785412505),
('copyright', 0.030679417273406895),
('private', 0.028556812483467545),
('peanutsstrip', 0.02763921285751134),
('keyphrase', 0.02763921285751134),
('reveal', 0.02700120333854981)],
111: [('wheelie', 0.2354158180022329),
('shaft', 0.19715198950698468),
('shaftdrive', 0.10323000158279452),
('motorcycle', 0.06055770780867878),
('grind', 0.05463837647054193),
('splitfire', 0.05428070890125142),
('frontwheel', 0.04832837234959504),
('clutch', 0.0469165653787704),
('effect', 0.045750525229521215),
('imposible', 0.04320892771303779)],
112: [('drink', 0.11303233811805648),
('ride', 0.10884622443419721),
('alcohol', 0.054914861941991654),
('drinking', 0.053539043362862346),
('drinktonight', 0.041722000050816055),
('cyclingcouple', 0.041722000050816055),
('sobriety', 0.04102158272054912),
('drunk', 0.04098137921906867),
('hour', 0.040119721154805964),
('drinkshour', 0.03722832752236221)],
113: [('abortion', 0.09125053945368067),
('fetus', 0.04677938990132528),
('child', 0.036969498865850325),
('human', 0.03351139557917321),
('parent', 0.03202875352752832),
('larrymargoli', 0.026112614209208993),
('premium', 0.025905144275951048),
('life', 0.025518196299452102),
('coverage', 0.024623482397489023),
('womb', 0.02455463786058787)],
114: [('duo', 0.0777230040968972),
('problem', 0.05560793027207993),
('freeze', 0.04735561096841055),
('apple', 0.044727514705815145),
('sleep', 0.03935526181262343),
('reboot', 0.03432626483969979),
('occasionally', 0.02862217330951077),
('reset', 0.027830694997456967),
('pram', 0.0270751472889907),
('software', 0.027004449927072682)],
115: [('nickpettefar', 0.10167583903310827),
('uknewsreader', 0.07767086531116178),
('ltdmaidenhead', 0.07767086531116178),
('unitedkingdom', 0.0731611909064114),
('tinversion', 0.06943656784302209),
('incarcerate', 0.06906360985559767),
('bnrmaidenhead', 0.05220928709888071),
('pettefarcurrently', 0.0455149091463448),
('conciseoxford', 0.038651052295653535),
('gmtwibble', 0.03729234792777193)],
116: [('ancient', 0.05578348760706917),
('document', 0.03051717672089392),
('medievalperiod', 0.029333529627255744),
('book', 0.026916162822524257),
('lewis', 0.02680721760550181),
('mystery', 0.025954293114076647),
('copy', 0.025012742208721062),
('harrison', 0.024971773423624208),
('rhetoric', 0.02428126890419976),
('argument', 0.022949332444613446)],
117: [('odometer', 0.1430687919288719),
('mileage', 0.053786130836405655),
('car', 0.04743518934987012),
('electronicodometer', 0.039769055367586376),
('sensor', 0.038766876055023956),
('reading', 0.032094102259624877),
('dealer', 0.027127906414111093),
('pulse', 0.026950847135860604),
('mile', 0.025652162021292002),
('oxygensensor', 0.024438389705859053)],
118: [('lanworkplace', 0.055879838447086036),
('os', 0.04575337853620259),
('chicogo', 0.036826423862716805),
('window', 0.034611026569177736),
('do', 0.0334730383691935),
('client', 0.026560616655507435),
('app', 0.025958148909901703),
('seperate', 0.025934283700103578),
('multithreade', 0.025557801031113814),
('wfwg', 0.025557801031113814)],
119: [('moa', 0.10087376321944605),
('bmwmoa', 0.09668905379783674),
('member', 0.041187025869505844),
('membership', 0.041038032666841306),
('politic', 0.040259792367748994),
('davidkarr', 0.0333549622253739),
('humor', 0.03296744690146493),
('lapse', 0.032577563507232545),
('signing', 0.030299368060697662),
('barnacle', 0.030263486550507066)],
120: [('weight', 0.12518093688110332),
('diet', 0.08797093208120046),
('cycle', 0.08295989320790818),
('chuckforsberg', 0.08078482511331075),
('wakgx', 0.07090846929531068),
('obesity', 0.06762644803201298),
('gordonbank', 0.059294400993931975),
('obesityresearcher', 0.042177758349034644),
('oraltradition', 0.04128736684529927),
('weightgain', 0.0377130213071537)],
121: [('serb', 0.057345747872705743),
('moslem', 0.048472229348799516),
('ethniccleanse', 0.040037630019952176),
('serbiangenocide', 0.039787867598129224),
('god', 0.0357365414271381),
('work', 0.02849615524022826),
('war', 0.026333625253330283),
('bosnia', 0.02414939470447389),
('judgement', 0.023205352044012733),
('uncomfortable', 0.023062855348442823)],
122: [('eye', 0.11380356836595809),
('handedness', 0.09413049245131667),
('eyedominance', 0.09057673625080101),
('rk', 0.06846916705285444),
('contactlense', 0.05445438951105747),
('eyedness', 0.05014944959845277),
('prk', 0.04921271019033068),
('dominant', 0.046801778927292724),
('lenscorrection', 0.0439122499160461),
('richardsilver', 0.040267889739326615)],
123: [('americanoccupied', 0.12106428849800348),
('failedpresident', 0.11308440106817347),
('replacedjimmy', 0.11308440106817347),
('redundancydepartment', 0.11308440106817347),
('georgebush', 0.10938243414394261),
('carter', 0.10626976966564901),
('tenyear', 0.09662449018769294),
('opinion', 0.08586687313568053),
('employer', 0.07811672377092033),
('standard', 0.05086016147619553)],
124: [('hiramcollege', 0.12709669737018062),
('package', 0.07737522749068376),
('voucher', 0.05944740488508054),
('sale', 0.05908997759864061),
('vhsmovie', 0.056946470671591594),
('wovie', 0.0549446256029223),
('dance', 0.051255391584154805),
('douglaskou', 0.04986391021218447),
('hirambhiram', 0.04986391021218447),
('beta', 0.04968676131750533)],
125: [('lyme', 0.1591425594615977),
('treat', 0.057979370142165254),
('physician', 0.05550517572770537),
('patient', 0.051981382544939995),
('gordonbank', 0.0517641616751063),
('lymedisease', 0.050648132132180786),
('poo', 0.045661081869238264),
('diagnose', 0.044713724297398796),
('culture', 0.03805944843235583),
('ld', 0.0333173745606279)],
126: [('selectiveservice', 0.058882392902826),
('securityadmistration', 0.03999920099972593),
('drafteesfinally', 0.03999920099972593),
('volunteerarmy', 0.03999920099972593),
('utterwaste', 0.03999920099972593),
('irssocial', 0.03999920099972593),
('naval', 0.03927264178220903),
('abolish', 0.03838667193226359),
('motorvehicle', 0.03747061366543473),
('agree', 0.03728230977917745)],
127: [('virtualreality', 0.04401526794244309),
('client', 0.03365439365099483),
('diaspar', 0.029042988414331828),
('model', 0.028129811990830908),
('svr', 0.02699920017358929),
('multiverse', 0.023775083808681603),
('operation', 0.023255996477202816),
('object', 0.021599140254575554),
('virtual', 0.0211917038262767),
('provide', 0.020875158163656135)],
128: [('image', 0.07166180530040553),
('sphinx', 0.0644993041215123),
('spect', 0.042642104994902494),
('imageprocessing', 0.03733140853822152),
('imaging', 0.03519547851063035),
('package', 0.03179196243899167),
('input', 0.0298535540411922),
('signal', 0.029087717143647634),
('analysis', 0.028777698562541757),
('aprs', 0.02843711326515909)],
129: [('lucifer', 0.09571104389245019),
('logically', 0.06453763651234762),
('evil', 0.05820482385172957),
('therefore', 0.05712578677781744),
('jehovahswitnesse', 0.05125617767725834),
('mercede', 0.04938418590548362),
('syllogism', 0.04057307088256163),
('free', 0.03814360032646522),
('omniscient', 0.03706224173802248),
('omc', 0.036821289458618976)],
130: [('environment', 0.06665621687065498),
('command', 0.06593140574756158),
('file', 0.06265146583030211),
('bat', 0.05819599076331669),
('exitcode', 0.05555902686078422),
('do', 0.04630539952526526),
('set', 0.04296906648423895),
('appdefault', 0.040955979172196635),
('window', 0.03970971100349082),
('pif', 0.039698858280018075)],
131: [('exposeevent', 0.12341406333172053),
('handler', 0.09663174065448546),
('rectangle', 0.07259098588229142),
('item', 0.06556231605807203),
('window', 0.05871585094282753),
('draw', 0.051805528284627485),
('map', 0.05071530366197182),
('button', 0.048845880084468433),
('mapped', 0.043477392391006446),
('callxcopyarea', 0.042665280051964946)],
132: [('gateway', 0.07681549515691719),
('tape', 0.02632550132810915),
('service', 0.02364439045101329),
('lbl', 0.0232673726424468),
('dealer', 0.023085838125324262),
('order', 0.021066926486735398),
('peer', 0.018378204064274303),
('wawbu', 0.018166188035808283),
('retail', 0.018106335538132335),
('controller', 0.01780075968506968)],
133: [('easter', 0.13987951466947263),
('resurrection', 0.08160773944931027),
('celebration', 0.07600518365138578),
('celebrate', 0.07205434146307106),
('ishtar', 0.061595196020180584),
('word', 0.047048690869164606),
('objection', 0.03828110426347549),
('french', 0.034013723094299717),
('pagangoddess', 0.0317941272086405),
('name', 0.02927122632751491)],
134: [('holocaustmemorial', 0.13057628797003842),
('dangerousmistake', 0.12504625376964829),
('museumcostly', 0.11486088672119268),
('monument', 0.04932016069434463),
('tax', 0.04579647000925824),
('federal', 0.04098517393110056),
('exmpt', 0.04038315760657041),
('jackschmidle', 0.03948829066409946),
('educate', 0.03755267847702901),
('private', 0.035822455054855186)],
135: [('blast', 0.071927616987391),
('properequipment', 0.06905673360850827),
('batf', 0.06662798298797837),
('compound', 0.05936350629959883),
('megafire', 0.04923606336780554),
('goodfoke', 0.04431245703102498),
('protect', 0.041845145903825744),
('armoredtransport', 0.040568805869328935),
('country', 0.0380209559219227),
('wod', 0.0379060294971259)],
136: [('interrupt', 0.18878557592779646),
('port', 0.14286143551042427),
('com', 0.13556722563829707),
('mouse', 0.06182362257397565),
('modem', 0.04837481781132188),
('card', 0.04402969065379614),
('serialport', 0.04379354961948205),
('conflict', 0.04342578832399484),
('printer', 0.037099095809735666),
('pc', 0.031687844479500946)],
137: [('adl', 0.1294064044322726),
('spy', 0.0691654920241457),
('aren', 0.058326283516843),
('gerard', 0.043796214653130675),
('police', 0.03609988324778664),
('yigal', 0.0358353198579558),
('information', 0.033734992140672625),
('consideryigal', 0.03097768986891098),
('yigalaren', 0.027184802858906624),
('confidential', 0.027184165297046307)],
138: [('motherboard', 0.12567793484688464),
('slot', 0.06593291009958513),
('card', 0.061838932791374174),
('micronic', 0.04302984913982701),
('magstripe', 0.03178263286732731),
('magnetic', 0.02959496706803388),
('case', 0.028608629884068457),
('powersupply', 0.02594117764239604),
('chassis', 0.025817909483896208),
('micron', 0.02558250809055905)],
139: [('serialport', 0.17334336446373577),
('serial', 0.08482946681218848),
('device', 0.07492807245872372),
('port', 0.0700109767811955),
('connect', 0.04944489179575653),
('simultaneously', 0.04497395706199964),
('printer', 0.043509753391128114),
('swii', 0.040081406936457516),
('working', 0.03931854958151048),
('modem', 0.038295382528227055)],
140: [('comicstrip', 0.10522537805584628),
('copy', 0.07830774935773173),
('appear', 0.07760394526187306),
('annual', 0.07535806879161107),
('cover', 0.07398160888926149),
('wolverine', 0.060943595258260166),
('newmutant', 0.05597956900201285),
('art', 0.04990145439451255),
('mcfarlane', 0.04464030785252035),
('punisher', 0.04464030785252035)],
141: [('bulb', 0.10309323978340915),
('uvlight', 0.06520030218450448),
('brightness', 0.0643678652100585),
('uv', 0.06357469121693145),
('string', 0.05585920146847542),
('blinker', 0.05220010570696789),
('uvflashlight', 0.044438198504159525),
('fluorescent', 0.0407501888653153),
('cuetape', 0.03752492747794877),
('glow', 0.037115466052533595)]}
I can view all the topics discussed in all of the documents.
Visualizing BERTopic and its derivatives is important in understanding the model, how it works, but more importantly, where it works. Since topic modeling can be quite a subjective field it is difficult for users to validate their models. Looking at the topics and seeing if they make sense is an important factor in alleviating this issue.
model2.visualize_topics()
model2.visualize_hierarchy()
model2.visualize_barchart()
query=input('Enter the query here :')
query_embedding = model.encode(query)
Enter the query here :What is your view regarding gun control?
top_k=5
cos_scores = util.cos_sim(query_embedding, embeddings)[0]
cos_scores = cos_scores.cpu()
# using torch.topk to find the highest 5 scores
top_results = torch.topk(cos_scores, k=top_k)
print("\n\n======================\n\n")
print("Query:", query)
print("\nTop 5 Most Similar Sentences in the Corpus:\n")
for score, idx in zip(top_results[0], top_results[1]):
print(df['text_cleaned'].values[idx], "(Score: %.4f)" % (score))
====================== Query: What is your view regarding gun control? Top 5 Most Similar Sentences in the Corpus: threaten gun_owner line nntp_poste host m article write m write story future gun_control point welcome opinion wonderful resource newsgroup take advantage thank advance feedback believe serious threat gun_owner future government liberal dea see concerned ammendment reinterpret apply armed_force bar civilian own arm kind well contribution taxis abortion elimination fetal tissue happen control type arm people allow buy type feel compel restrict military use hydrogen bomb perhaps describe hci gun_control activist determine make illegal civilian firearm personally read brady_bill entirety thank know truth truth make free (Score: 0.5477) gun_control mad tv news nntp_poste organization university line article steve_mane write know state gun_control effect homicide_rate think argue effect effect also consider negative side law_abide citizen armed pistol part prevent national crime year extreme study find number crime homicide private ownership firearm approximately live year roughly criminal homicide fatal accident involve gun year net benefit show gun_control measure disarm criminal currently use gun hard accord federal batf criminal buy gun counter gun_control law nature effect legal sale law remove benefit arm law_abide citizen minimal effect armed criminal large get gun illegally sound net benefit license weapon licensed weapon assume support reasonable law waiting_period background_check license complete ban alter statistic refer assume s support way people die fall stair accidental handgun death significant next household accident american child accidentally shoot child last year handgun_homicide child age die drown drink poisonous household chemical drano fall real goal reduce tragic accidental_death child ban drain cleaner well palce start perhaps restrict ownership professional plumber please dictionary argument rate total number re offer emphasis comparison call emphasis refer completely statistic sentence comparison valid put number together convince people right kind thing call propaganda cu_boulder (Score: 0.5366) ban firearm life health_science line article paul_prescod write drug ban tell supply dry drug easy manufacture easy smuggle easy hide comparison ignorant fool know drug business gun business editor freedom network international society market fax think universally act selfishly (Score: 0.5252) gun law organization canadian moderator nice summary thank talk federal try clarify bunch thing regard change canadian gun law post informational purpose question email followup still technically feasible almost impossible get tell still legal lethal force protect life also contrary officer tell gun store lock unload however regard capacity magazine still clear exempt manage province general idea exempt person receive letter form authorize possess high capacity magazine apparently authorization specify many prohibit weapon allow possess dealer allow order high capacity mag allow possess allow stock high capacity magazine convert comply new limit consider prohibit weapon amendment regulation specify possible method alter marketing reduce capacity magazine know much charge cover discuss type memory take gospel lawyer refuse play tv ofah frontenac club (Score: 0.4903) gun backcountry thank university line article write wrong whole gun protection mindset ignore systemic effect cumulative individual action want fire insurance house s prudent effect bunch paranoid pack handgun backcountry make else choose protect manner pretty king nervous re threat re affect mean take logical conclusion suppose carry handgun time protection people carry handgun collectively feel safe hell d feel lot insecure note available psych info say feeling security increase victimization stat say increase rational systemic effect good people protect bad people go modify behavior response re go much itchier much willing kill people course routine mugging think happen instead switch change behavior property crime s improvement even economic take unchanged sure switch kill (Score: 0.4898)
Using the torch package combined with the sentence transformer from the BERTopic model I can infer the question with the topmost similar sentences in the corpus and evaluate each sentence with respect to the provided query.
That is all for the implementation of those topic models. Each model has a very different approach, and the one to choose from depends greatly on the problem one is trying to solve.
